This is a notebook that will show how to do some very basic frequentist inferential statistics in R - a binomial test, a Chi Square test, and an independent two group t-test. This is by no means an exhaustive exercise in inferential statistics, and the data we’re using was collected to illustrate the use of the tests rather than reveal some heretofore undiscovered truth about humanity.
However, it’s worth noting that these are the kinds of tests you might do with your own data in an independent project. Applying the concepts here to your own data later on (even if it’s e.g., corpus data and not experimental) will go a long way to you doing well on your independent project next year.
We did a survey of three very simple questions:
You can find the survey here. I’ve declared this endeavor Prawn Gif because it’s mainly about the pronunciation of the word gif (abbreviated as PronGif - for me, Pron and Prawn are homophones). You can find the notebook, data, etc here on RStudio Cloud.
First, a reminder about how this works. This is a compiled RMarkdown file I’ve made into a webpage so you can follow along. These are like the instructions. However, to implement these, you need to open the PrawnGif.Rmd file in RStudio. Here’s the link to the RStudio Project: Remember to start by saving a permanent copy. You should now have three tabs open:
We’ll start by reading in the data; but first we want to do some preliminary work on the raw data to get it into the right format.
#Anything after a hashtag in a code block is a comment
#These don't do anything (the hashtag is a cue to R to ignore it)
# but they provide useful context for the code
#This reads all of our data from the tsv file into a data frame called "dem" (short for demonstration)
dem<-read.csv("DataDem.tsv",header=T, sep="\t")
#The head() command shows us the first few rows of our new dataframe - this should like exactly like the spreadsheet.
head(dem)
## Timestamp PronGif PrefSocial SickHome
## 1 3/23/2020 19:04:13 gig Twitter 14
## 2 3/23/2020 19:05:17 giant Twitter 0
## 3 3/23/2020 19:05:36 gig Twitter 40
## 4 3/23/2020 19:05:47 gig Twitter 50
## 5 3/23/2020 19:06:33 gig Twitter 20
## 6 3/23/2020 19:06:42 gig Twitter 0
What does the data look like? We know from the original summary within the survey that most people prefer a voiced velar stop (as in gig, ~77%) over an affricate (as in giant, ~22%) when pronouncing gif, and they prefer Twitter (~80%) over Facebook (~20%).
First, let’s think about these basic results and what they mean: - People in our sample prefer a voiced velar stop over an affricate for the first sound in the word gif. - People in our sample prefer Twitter to Facebook.
Now, we have to start by thinking about these results in the context of our study. The first result is pretty interesting, but the second one is less so: this is probably a sampling issue due to my personal social network sizes on the two platforms (most of the participants came from me just asking people who followed me on these platforms to fill out the survey).
Still, what we don’t know is how these intersect - in other words, how users of Facebook pronounce gif relative to users of Twitter, so let’s have a look at this. First, let’s decide how we want to visualise this - we can sketch out what we’re looking for.
p<-ggplot(data=dem, aes(x=PronGif,fill=PrefSocial))+
geom_bar(position="fill")+
ylab("Proportion")+
xlab("Pronunciation Preference")+
theme_bw()
p
We also haven’t looked at whether people’s fatigue with being home varies meaningfully between pronunciation styles or social media network preferences, so let’s have a look at that too:
p2<-ggplot(data=dem,aes(x=PronGif,y=SickHome))+
geom_violin(aes(fill=PronGif))+
xlab("Pronunciation Preference")+
ylab("How sick are you of being at home?")+
theme_bw()
p2
p3<-ggplot(data=dem,aes(x=PrefSocial,y=SickHome))+
geom_violin(aes(fill=PrefSocial))+
xlab("Preferred Social Media Network")+
ylab("How sick are you of being at home?")+
theme_bw()
p3
The binomial test is one of the simplest tests out there. It takes two categories and determines if the counts in those categories differ from what we would consider to be the null hypothesis. In this case, we’ll use 50/50. In other words, our null hypothesis is that people have no particular preference between the velar stop (gig) and affricate (giant) pronunciations of gif.
We use the binom.test() function.
#summary(dem$PronGif)
#giant gig
# 53 160
binom.test(160,213,0.5)
##
## Exact binomial test
##
## data: 160 and 213
## number of successes = 160, number of trials = 213, p-value = 1.127e-13
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.6875007 0.8077126
## sample estimates:
## probability of success
## 0.7511737
This shows (as we could have guessed from a look at the proportions), that this data is very unlikely to come from a population that has no particular preference in pronuciation - in fact, there is only a .0000000000127% chance that we would get 160 velar stops and only 53 affricates by sampling a population that had no particular preference in how they pronounced gif.
A chi square test will tell us whether two groups differ from each other in terms of their response or membership for some other categorical variable. For our data, we can look at whether people who prefer Facebook differ in terms of how they prefer to pronounce gif from people who prefer Twitter. The null hypothesis would be that we won’t find any difference in pronunciation between the two groups.
First, we need to convert this to a contingency table that gives us the number of Facebook users that prefer giant and gig, and likewise for Twitter:
demTab<-table(dem$PronGif,dem$PrefSocial)
demTab
##
## Facebook Twitter
## giant 12 41
## gig 31 129
Our null hypothesis here would be that there is no difference in pronunciation of the word gif across different social media networks. The chi square test is going to tell us if the proportion of giant pronouncers is more or less the same across Facebook and Twitter (i.e., comes from a null hypothesis world where preferred social media network has nothing to do with gif pronunciation), or if it is different across Facebook and Twitter (comes from a world that would support our hypothesis that there is some systematic difference here).
chisq.test(demTab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: demTab
## X-squared = 0.099888, df = 1, p-value = 0.752
These results, p=0.752, mean that there is a 75.2% chance that these values come from a population where there is no systematic difference in pronunciation across groups. This is a pretty high chance, we can’t reject the null hypothesis. In other words, which social media network you prefer probably has nothing to do with how you pronounce gif.
Here we’ll perform an independent two group t-test. The null
hypothesis here would be that the mean of a continuous variable
(in this case SickHome) does not differ between
two groups (in this case, either Facebook and Twitter users or gif/jif
pronouncers). There are other kinds of t-tests (and relevant detail
about one-tailed
vs two-tailed hypotheses) that we won’t cover here, but note that
t-tests always deal with differences in means, and therefore
require at least one continuous variable.
# independent 2-group t-test
pronunciation<-t.test(dem$SickHome~dem$PronGif) # format is t.test(x~y), where x is numeric and y is a binary categorical
socialmedia<-t.test(dem$SickHome~dem$PrefSocial)
pronunciation
##
## Welch Two Sample t-test
##
## data: dem$SickHome by dem$PronGif
## t = -0.094184, df = 90.717, p-value = 0.9252
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -10.540219 9.585973
## sample estimates:
## mean in group giant mean in group gig
## 43.91038 44.38750
socialmedia
##
## Welch Two Sample t-test
##
## data: dem$SickHome by dem$PrefSocial
## t = 1.6789, df = 64.679, p-value = 0.098
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.750482 20.209073
## sample estimates:
## mean in group Facebook mean in group Twitter
## 51.63488 42.40559
The test for SickHome~PronGif has a very high p-value (p
> 0.9), meaning there is over a 90% chance that our results come from
a population where velar stop gif pronouncers are no more or less sick
of being at home than affricate gif pronouncers.
The second test, for SickHome~PrefSocial, has a much
lower p-value of 0.098, but this is still higher than the usual critical
\(\alpha\) = 0.05. Looking at the mean
in-group estimates for Facebook and Twitter at the bottom of the test
output, we can see that Facebook users are slightly more sick (51.6) of
being at home than Twitter users (42.4). The p-value means that there is
about a 10% chance we could have gotten this slight difference from a
population where there is actually no difference at all, and usually
researchers consider this chance to be too high, and so can’t reject the
null hypothesis. However, some fields use an \(\alpha\) = 0.1, and others who did use
\(\alpha\) = 0.05 might refer values of
\(\alpha\) < 0.1 as “marginally
significant”, meaning they are suggestive of support for the hypothesis,
but inconclusive (further work is needed).
Something that requires caution is that we perhaps did a bit of
digging here: if we test every categorical variable we can think of
against SickHome (of course, we only had two) we’ll
eventually find one that looks like it’s systematically different across
groups. Remember that the p-value represents the probability that you
would find the result you did given the null hypothesis is true, and the
more you query the same data, the more likely you become to accidentally
find support for your hypothesis even though the null hypothesis
shouldn’t be rejected (i.e., you become more prone to Type I
errors).